Manipulating taxonomic data is a more subtle problem then it might seem at first. For example, if you want to remove a taxon, do you remove its supertaxa and subtaxa as well? What if there are sequences assigned to that taxon; are the sequences removed or reassigned to a preserved supertaxon? What if the taxon is an internal node in the taxonomy; do you connect its supertaxa and subtaxa or break the taxonomy? The answers to these questions depend on what the goal of the subsetting is. Metacoder uses dplyr-style functions for manipulating taxonomic data. For each dplyr verb, there are two functions in metacodeRr, one that manipulates the taxon portion of the data and one that manipulates the observation portion.
Often, there are many shared ranks in the taxonomic hierarchy that can make effective visualization difficult:
library(metacoder)
set.seed(2)
heat_tree(genbank_ex_data,
node_size = n_obs,
node_color = n_obs,
node_label = name,
layout = "davidson-harel")
filter_taxa can easily remove these taxa by selecting the root taxon by name:
set.seed(1)
filter_taxa(genbank_ex_data, name == "Basidiomycota", subtaxa = TRUE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name,
layout = "davidson-harel")
The filtering operation could be done this way as well:
filter_taxa(genbank_ex_data, n_supertaxa > 4)
You can also remove subtaxa by filtering with a specified rank:
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
node_label = name)
filter_taxa(unite_ex_data_3, n_supertaxa <= 3) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Or we can filter by the number of observations assigned to each taxon:
filter_taxa(unite_ex_data_3, n_obs >= 3) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name,
tree_label = name)
This is useful when plotting very large data sets, since it is difficult to make effective visualizations of over ~2000 taxa. Note that observations assigned to removed subtaxa are reassigned to the closest supertaxa that passes the filter by default. You can prevent this by setting the reassign_obs option to FALSE, but when most observations are assigned to tip taxa, this is rarely useful:
filter_taxa(unite_ex_data_3, n_supertaxa < 4, reassign_obs = FALSE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
You can also remove internal taxa:
filter_taxa(unite_ex_data_3, unite_rank != "c") %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Note that the above result has no fungal classes anymore (taxa with names ending in “mycetes”). Like observations, subtaxa of removed taxa are reassigned to the closest supertaxon that passes the filter. Although it usually does not make much sense to not reassign taxa, it is possible:
filter_taxa(unite_ex_data_3, unite_rank != "c", reassign_taxa = FALSE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Filtering observations assigned to taxa is less complicated. The code below removes all taxa with seq_ids that do not start with “A”.
filter_obs(unite_ex_data_3, grepl("^A", seq_id)) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
You can remove any taxa that are made unobserved by the filtering using the unobserved option:
filter_obs(unite_ex_data_3, grepl("^A", seq_id), unobserved = FALSE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Random sampling of taxa and observations is similar to subsetting, except you provide weights to each observation or taxon indicating how likely it is to be included in the subset. The random subset of taxa or observations is then passed to filter_taxa or filter_obs respectively. Therefore, all the options of filter_taxa or filter_obs can be used within sample_n_taxa and sample_n_obs.
Sampling observations is useful for making a subset of a large data set (not that the example below uses a large data set):
sample_n_obs(unite_ex_data_3, size = 100, unobserved = FALSE) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Weights can be assigned to observations to determine how likely each is to be sampled:
sample_n_obs(unite_ex_data_3, size = 50, unobserved = FALSE,
obs_weight = ifelse(grepl("Agaricales", seq_name), 100, 1)) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
You can also assign weights to observations based on the taxon they are assigned to:
sample_n_obs(unite_ex_data_3, size = 100, unobserved = FALSE,
taxon_weight = 1 / n_obs) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Taxa can be sampled the same way observations are sampled. The code below randomly selects 5 taxa of rank “class”:
set.seed(1)
sample_n_taxa(unite_ex_data_3, size = 5, subtaxa = TRUE,
taxon_weight = ifelse(unite_rank == "c", 1, 0)) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name, tree_label = name)
When randomly subsetting taxa, pay special attention to the options of filter_taxa since excluding them can have a drastic effect:
set.seed(1)
sample_n_taxa(unite_ex_data_3, size = 100) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
sample_n_taxa and sample_n_obs have simple wrappers called sample_frac_taxa and sample_frac_obs that sample a given proportion of the total number of rows:
set.seed(1)
sample_frac_obs(unite_ex_data_3, size = 0.1) %>%
heat_tree(node_size = n_obs,
node_color = n_obs,
node_label = name)
Subsetting columns is more straight forward than subsetting rows. The functions select_taxa and select_obs are little more than wrappers for dplyr::select. The only thing they do differently is to enforce that the taxon_ids, supertaxon_ids, and obs_taxon_ids columns are preserved:
unite_ex_data_3
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 4
## taxon_ids supertaxon_ids unite_rank name
## <chr> <chr> <chr> <chr>
## 1 1 <NA> k Fungi
## 2 2 1 p Ascomycota
## 3 3 1 p Basidiomycota
## 4 4 1 p Chytridiomycota
## 5 5 1 p Glomeromycota
## 6 6 1 p unidentified
## 7 7 1 p Zygomycota
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
select_taxa(unite_ex_data_3, unite_rank)
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 3
## taxon_ids supertaxon_ids unite_rank
## <chr> <chr> <chr>
## 1 1 <NA> k
## 2 2 1 p
## 3 3 1 p
## 4 4 1 p
## 5 5 1 p
## 6 6 1 p
## 7 7 1 p
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
Note how the “name” column has been removed from “taxon_data”. You can also use this to reorder columns:
select_obs(unite_ex_data_3, other_id, seq_name)
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 4
## taxon_ids supertaxon_ids unite_rank name
## <chr> <chr> <chr> <chr>
## 1 1 <NA> k Fungi
## 2 2 1 p Ascomycota
## 3 3 1 p Basidiomycota
## 4 4 1 p Chytridiomycota
## 5 5 1 p Glomeromycota
## 6 6 1 p unidentified
## 7 7 1 p Zygomycota
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 3
## obs_taxon_ids other_id seq_name
## <chr> <chr> <chr>
## 1 183 SH189775.06FU Lachnum_sp
## 2 175 SH189776.06FU Lachnellula_calyciformis
## 3 183 SH189777.06FU Lachnum_sp
## 4 183 SH189778.06FU Lachnum_sp
## 5 183 SH189779.06FU Lachnum_sp
## 6 181 SH189780.06FU Lachnum_pulverulentum
## 7 183 SH189781.06FU Lachnum_sp
## # ... with 493 more rows
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
Adding a column to either taxon_data or obs_data is easy using the dplyr syntax:
mutate_taxa(unite_ex_data_3, new_col = "Im new!")
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 5
## taxon_ids supertaxon_ids unite_rank name new_col
## <chr> <chr> <chr> <chr> <chr>
## 1 1 <NA> k Fungi Im new!
## 2 2 1 p Ascomycota Im new!
## 3 3 1 p Basidiomycota Im new!
## 4 4 1 p Chytridiomycota Im new!
## 5 5 1 p Glomeromycota Im new!
## 6 6 1 p unidentified Im new!
## 7 7 1 p Zygomycota Im new!
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
A convenient attribute of dplyr::mutate is the ability to reference newly created columns:
mutate_taxa(unite_ex_data_3,
new_col = "Im new!",
newer_col = gsub(pattern = "!", replacement = "er!!", new_col))
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 6
## taxon_ids supertaxon_ids unite_rank name new_col newer_col
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 <NA> k Fungi Im new! Im newer!!
## 2 2 1 p Ascomycota Im new! Im newer!!
## 3 3 1 p Basidiomycota Im new! Im newer!!
## 4 4 1 p Chytridiomycota Im new! Im newer!!
## 5 5 1 p Glomeromycota Im new! Im newer!!
## 6 6 1 p unidentified Im new! Im newer!!
## 7 7 1 p Zygomycota Im new! Im newer!!
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
Adding observation columns with mutate_obs works the same way.
transmute_taxa and transmute_obs make new columns and discards all of the old columns:
transmute_taxa(unite_ex_data_3,
new_col = "Im new!",
newer_col = gsub(pattern = "!", replacement = "er!!", new_col))
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 4
## taxon_ids supertaxon_ids new_col newer_col
## <chr> <chr> <chr> <chr>
## 1 1 <NA> Im new! Im newer!!
## 2 2 1 Im new! Im newer!!
## 3 3 1 Im new! Im newer!!
## 4 4 1 Im new! Im newer!!
## 5 5 1 Im new! Im newer!!
## 6 6 1 Im new! Im newer!!
## 7 7 1 Im new! Im newer!!
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
The way taxmap objects are defined, the order of all of the components do not matter. This means its easy to reorder them to fit your needs:
arrange_taxa(unite_ex_data_3, name)
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 4
## taxon_ids supertaxon_ids unite_rank name
## <chr> <chr> <chr> <chr>
## 1 472 419 s abieticola
## 2 686 679 g Absidia
## 3 251 245 g Acremonium
## 4 249 244 g Acrostalagmus
## 5 690 681 g Actinomucor
## 6 571 557 s aff_brevipes_r_04085
## 7 477 474 s aff_fellea_PBM_2825
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
To change the direction of ordering, use dplyr::desc:
arrange_taxa(unite_ex_data_3, desc(name))
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 697, 698, 699, 700, 701, 702, 703
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 703 × 4
## taxon_ids supertaxon_ids unite_rank name
## <chr> <chr> <chr> <chr>
## 1 7 1 p Zygomycota
## 2 240 17 o Xylariales
## 3 284 240 f Xylariaceae
## 4 296 284 g Xylaria
## 5 500 492 g Xerocomus
## 6 36 32 s xanthorrhoeae
## 7 177 172 s willkommii
## # ... with 696 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] metacoder_0.1.2 knitcitations_1.0.7 knitr_1.14
##
## loaded via a namespace (and not attached):
## [1] igraph_1.0.1 Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.2-7 R6_2.2.0 bibtex_0.4.0 stringr_1.1.0
## [9] httr_1.2.1 plyr_1.8.4 dplyr_0.5.0 tools_3.3.1
## [13] gtable_0.2.0 DBI_0.5-1 htmltools_0.3.5 lazyeval_0.2.0
## [17] assertthat_0.1 yaml_2.1.13 rprojroot_1.2 digest_0.6.12
## [21] tibble_1.2 RJSONIO_1.3-0 ggplot2_2.2.1 reshape2_1.4.2
## [25] RefManageR_0.13.1 formatR_1.4 bitops_1.0-6 RCurl_1.95-4.8
## [29] evaluate_0.10 rmarkdown_1.3 labeling_0.3 stringi_1.1.2
## [33] scales_0.4.1 backports_1.0.5 XML_3.98-1.4 lubridate_1.6.0
Comments